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Polar codes are one of the best linear block codes that are capacity achieving 
and incorporating along with it a simplified encoding and decoding routines. 
Successive cancellation (SC) algorithm is one of the predominantly used 
decoding algorithms due to its low complexity. It has broad scopes for 
hardware architecture design and reformulation. For polar code, the trade-off 
among the long latency and the silicon area of the SC algorithm is a 
bottleneck for the design of a high throughput polar decoder. The available 
prior SC polar decoder designs have higher area requirements for higher 
block length. This paper introduces a unique reformulation of the processing 
element (PE) block of SC decoding. The proposed reformulation leads to 
two benefits: firstly, critical path and hardware complexity of the PE are 
meaningfully reduced by using a unified adder block. Secondly, the silicon 
area requirement and the power consumption were also reduced 
considerably without any loss in performance. The proposed PE is used to 
build the decoder for various block lengths. Moreover, a Gate-level analysis 


of the proposed decoder has revealed that the design attains an 18% area 
reduction and 38% reduction in power consumption over the conventional 
one with similar performance. 
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1. INTRODUCTION 

Errors owing to interference, device reparations, and arbitrary noise corrupt the actual data stream 
before they reach the receiver end. These errors need to be corrected before they are used at the receiver end. 
Channel coding techniques helps to minimize the errors and rectify them appropriately to reconstruct the 
original data. Channel coding primarily modifies the original data stream at the transmitter to minimize the 
error occurrence and reverts them at the receiver end to reconstruct the original signal. The term encoding 
denotes the operations at the transmitter and decoding indicates the processes at the receiver, respectively. 
The prime focus is to develop high-performance channel codes with adequately low complexity that 
diminishes the impact of errors in a communication system. This allows timely, practical implementation into 
the silicon technology of the day. 

Proposed by E Arikan, the polar codes stand as one of the finest capacity-achieving codes having 
low encoding and decoding complexity of the order O (N log N), where N is the code length [1]. Polar codes 
function on blocks of bits and are therefore classified under the block code family. Polar codes are 
established on a recursive concatenation of the short core functions that convert physical channels to virtual 
channels. As the virtual channel count increases, they tend to have either low reliability or high reliability i.e., 
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they polarize. This feature enables the message bits to be allotted to the most reliable channel. Erdal Arikan 
put forth the first version of capacity achieving error correcting codes (ECC) for binary-input discrete 
memoryless channel [1]. Since then polar codes have emerged as one of the prominent model. Polar codes 
are known for their exceptional encoding and decoding architecture which have low complexity and 
recursive structure. They are constructed based on channel polarization, where message bits are transmitted 
through the most reliable channel [1]. The widely used successive cancellation decoder (SCD) has an 
attractive architecture with low hardware complexity of the order O (N log N), where N is the block length. 

Several architectures have been proposed for the successive cancellation decoder (SCD). In SCD, 
there are three major blocks, including the processing element (PE), partial sum generation unit (PSG) and 
memory unit (MU). In this work, an area-efficient architecture for the PE and hence the decoder is proposed. 
An architectural modification has been carried out on the PE, which reduces the overall area to a great extent. 
SCD is the widely used algorithm that can assure exceptional error-correcting performance in polar code 
decoding. The hardware architectures of the conventional SCD occupies a large silicon area. Improvements 
in polar decoding techniques and systems are eagerly awaited by the 5G telecommunication industry, where 
even more complex systems are being developed that require less power, area and obtain high throughput and 
speed. The challenge in the 5G wireless system is to produce a suitable channel coding scheme to ensemble 
increasing spectral efficiency. Research in SCD has been going on recently, and therefore, it is worthwhile to 
come out with a new architecture for the same. 

The original version of SCD was proposed in [1] and since then the derivative versions of the same 
decoder has emerged with improved performance. In spite of the availability of other decoding schemes, the 
SCD stands unique with its simple architecture easy construction. Yuan et al. had implemented a two bit low- 
latency polar decoder by using reformulation technique [2]. Leroux et al. had used scheduling and allocation 
method to reduce the hardware complexity of the decoder [3]. The logarithmic implementation [3], [4] 
significantly reduced the hardware complexity. The unrolled architecture proposed by Giard et al. has 
produced a high-throughput in the decoder [5]. A novel merged processing element for SCD based on one’s 
complement was proposed in [6] which helped to increase the throughput. In [7], a Novel simplified merged 
processing element (SMPE) architecture was proposed for SCD, and the decoding tree was constructed with 
low complexity. The SCD has the greatest throughput and energy per bit [8] compared to other decoders. 
Polar codes have been verified to be capacity achieving for binary-input symmetric memoryless channels [9]. 
The combinational logic based SCD produced a low-power performance [10]. A non-recursive method to 
create a decoding schedule without affecting the performance has been portrayed in [11]. 

In [12], path splitting selection (PSS) strategy aided decoder was proposed to lessen the decoding 
complexity with negotiable performance loss. Based on PSS, two schemes were suggested to locate flawed 
information bits more precisely. Apart from the PE and PSG, the memory unit is another sub-block that can 
be optimized according to the requirement [13]. The PE operates with log-likelihood ratio (LLR) instead of 
likelihood ratio (LR). Hence, polar codes are preferred over the low-density parity-check codes (LDPC) on 
wireless communication channels due to their better performance [14]. There are rate-less polar code 
implementations [15], [16] where the number of frozen bits varies based on the information bits' length. This 
feature provides the extendibility of the decoder for various code rates [17]. Polar codes are extended to non- 
identically distributed channels and they were found to be capacity achieving but the latency and complexity 
were sacrificed [18]. 

Captivatingly, the 5G standardization process of the 3rd generation partnership project (GPP) has 
preferred polar codes as the channel coding structure. Therefore, it is clear that different architectural 
modifications have been proposed on the conventional SC approach to improve its performance like 
increased throughput, low power, etc. In that way, this paper presents a reduced area SCD with low power 
consumption. On careful examination of the 2-bit SCD [2], it was eminent that the functioning of the g node 
in the PE comprises of addition and subtraction performed in parallel and one of the output is chosen based 
on the previously decoded bit of the f node in the PE. It is clear that one among these is an unwanted 
operation and can be suitably eliminated. The idea to eliminate the idle branch of the decoding tree prospered 
to removing the subtraction operation from the PE and carrying out the same with the existing adder. The 
hybrid processing element proposed here is used to build the SCD. Further, gate optimization is also carried 
out in the proposed decoder and a noticeable improvement was achieved. The total area of the decoder was 
reduced by 18%, and the power was reduced by 38% compared with existing recent architectures. During this 
research, the functionality of the SCD was the same as the original proposed SCD in [1]. 

The remaining paper is arranged as follows: Section 2 highlights polar codes’ encoding and 
decoding process and the proposed system design consisting of the proposed hybrid processing element and 
the decoder. Section 3 deliberates the implementation results, and the paper is concluded in section 4. 
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2. RESEARCH METHOD 

Polar codes fall under the banner of linear block codes and are one of the available forward error 
correction (FEC) codes. They have low computational complexity and are capacity-achieving codes. Polar 
codes with rate R=I/N have a code length of N=2‘n, and I (O<I<N) denotes the number of information bits, 
whereas N-I denotes the number of frozen bits added to the information bits. The construction of polar codes 
follows a very simpler approach compared to other traditional approaches like turbo codes. Encoding occurs 
in the transmitter end, where the channel would be polarized into reliable and unreliable channels. The 
information bits would be transmitted on the I most reliable channels and the remaining N-I channels stand 
frozen and are set to 0. 


2.1. Polar code-encoding 
The encoding approach of polar codes as the (1) [3]. 


N N N C9) N -® 
x) =u, Gy =u; ByF~" =u;'F°"By (1) 


0 


: , (.)@" represents the nth Kronecker power, 


1 
where Gy = By F ©” denotes the generator matrix, F = F 


: i N : ; : 
By denotes the bit reversal vector, the input vector represented as U; = U,,U),...U, having I information 


bits and (N-I) frozen bits. Also, ae = X,,X,,...Xy represents the encoded value [12]. The position of 


frozen bits are identified by the method explained in [4]. The generator matrix for N=8 is given as (2) and 


(3). 


Gg = BaF? (2) 
[1000000 0] 
11000000 
10100000 (3) 
11110000 
Gg = Bg 
10001000 
11001100 
10101010 
HP 1 1t111 4 


Once Gy is available, the input vector is encoded and prepared for transmission. For rate R = 0.5, the 
number of information bits will be equal to the number of frozen bits. 


2.2. Polar code-decoding 
The decoding approach is carried out with the widely used Successive Cancellation decoding 


algorithm [19], [20]. In the decoding scheme, the information bits iG =U,,U,,..Uy are retrieved 
sequentially from the received vector yy = V1>Yoo--Yny~- The output bits at stage t can be decoded by 


processing the LLR function as (4) and (5). 


uj = O(LL(i, t)) (4) 


(5) 


h 1,if LLG, t) <Oand wheniis free 
where, u; = 
10, if LLG, t)>0 and wheni is frozen 


Both encoder and decoder facilitate well-ordered processing configurations and appropriate module 
sharing owing to their recursive build. Every stage is composed of the kernels f and g appropriately 
scheduled to decode the data. The f and g kernels function based on (6) and (7). 


f =sign(c)sign(d) min(|c |,| d |) (6) 
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where fsun decides between addition and subtraction in the kernel g, c and d represent the LLR inputs. fsum 
represents the partial sum of the subset of previously decoded bits. The Conventional decoding tree is shown 
in Figure 1. 


Stage 3 Stage 2 Stage I 


Figure 1. Conventional decoding tree 


In Figure 1 the highlighted path is the decoding path for bit @;. This takes three clock cycles. The 
calculation of #2 occurs in the next clock cycle through the corresponding g kernel which operates with fsum = 
fiz. At the end of 14 cycles uw; — ug are decoded. The decoding schedule for a code block length of 8 is 
presented in Table 1. Thus the conventional SCD takes 2N-2 clock cycles to solve a code of length N [1]. Po 
et al. [21] proposed a polar code decoder with variable R and N. 


Table 1. Decoding schedule of the conventional SCD for N=8 


Stage 
Clock cycle Sl §2. 83 Output 

1 f 

2 f 

3 f ay 
4 g ty 
5 g 

6 f a3 
7 g tig 
8 g 

9 f 
10 f fis 
11 ys tis 
12 g 
13 f fiz 
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2.3. Proposed new hybrid processing element 

The proposed decoder has an area optimized unique scheme of combinational logic that functions to 
eliminate the unnecessary subtraction operation carried out in the PEs. The proposed PE is termed new 
hybrid processing element (NHPE), which is used to design the decoder. The conventional SCD has the f 
node and g node that form the decoder's processing element and are scheduled appropriately to form the 
decoding tree. After the development of the merged processing element, the f and g operations are selected 
using a control signal and the decoding tree is constructed accordingly. The research for better-merged 
processing continued and evolved into 2b-SC decoding, 2b overlapped SC decoder, 2b precomputation 
decoder and so on [2]. The architectures are optimized for area, power, speed and better architectures were 
produced. This paper proposes a novel area-optimized processing element with low power consumption 
compared with the existing processing elements. 

On careful examination of the merged processing element in [2], it was noted that during the g node 
operation, both addition and subtraction are performed in all the iterations. But for the successful functioning 
of the decoder, only one operation is sufficient and it is decided by the partial sum of the previously decoded 
bits as (7). The proposed NHPE uses a unique scheme of combinational logic to eliminate the subtraction 
operation throughout the entire architecture along with a 2x1 multiplexer that chooses between addition and 
subtraction. This modification in the NHPE is reflected in the whole decoder. In the proposed scheme, the f 
node operates in the conventional way with the sign and magnitude processed separately. The proposed 
modification is effected in the g node operation. As soon as the f node is executed, the partial sum is 
generated which triggers the adder of the g node through the combinational block of XOR gate. When 
subtraction is to be performed, the sign bit of LLR(d) is negated and addition is performed. Table 2 shows the 
truth-table for the proposed idea. 


Table 2. Truth table of the proposed scheme 
Sign of d Value of fisum Operation to be performed — Modified sign of LLR(d) 
Positive (0) 0 Addition 0 


Negative (1) 0 Addition 1 
Positive (0) 1 Subtraction 1 
Negative (1) 1 Subtraction 0 

The following example explains the operation. 

If Gsum = 0, i.e. Addition to be performed. 

Let's take c = +3 and d= -5. 

Sign rep(c) = 0 0011, Sign rep (d) = 1 0101 

New Signed bit (d) = Ex-OR of Previous sign bit (d) and Oem = (1, 0) = 1 

Now 2s comp _rep(c) = 0 0011, 

2s_comp_rep (modified d) = 1 1011, 

Now ctd = 1 1110. Sign(ct+td) = 1 0010 i.e -2. 

If Gsm = 1, i.e. subtraction to be performed. 

Let's take c = +3 and d = -5. 

Sign rep(c) = 0 0011, Sign rep (d) = 1 0101 

New Signed bit (d) = Ex-OR of Previous sign bit (d) and Gem = (1, 1) = 0. 

modified d = 0 0101. 

Now 2s comp _rep (c) = 0 0011, 

2s_comp_rep (modified d) = 0 0101, 


Now atb = 0 1000. Sign (atb) = 0 1000 i.e 8. 
The above examples suitably explain Table 1, thus eliminating the subtraction and multiplexer 


operations successfully. The NHPE is shown in Figure 2. The architecture clearly shows that the subtraction 
operation is eliminated and a hybrid processing element is proposed. 
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LLR(c) 


Minimum 
of 
lc] , |d| 


LLR(d) 
<<] 


f/g output 


Ctrl 


Figure 2. Proposed processing element-NHPE 


2.4. Proposed SCD based on NHPE 

The proposed NHPE is used to build the decoder. The proposed modifications in the NHPE when 
realized in the entire decoder has a greater impact. In stage 1, all the PEs work as f node in the first cycle. In 
the second cycle, all the PEs of stage 2 are f nodes, and in the third cycle, the PE in stage 3 decodes the 
output message bits. This is the decoding sequence of the decoder and this process takes a total of (1.5N-2) 
clock cycles. When pipelining is introduced in this process, the decoding latency can be further reduced by a 
few cycles. Pipelining is introduced between the stages to speed up the entire decoding process. When 
pipelining is introduced, the decoding latency can be reduced from 1.5N-2 to N-1 cycles. The f node has a 
delay of Ts = 0.709ns, whereas the g node has a delay of T, = 1.86ns. The other delays involved are Tmux = 
0.136ns and Txor = 0.148ns. In stage 3, once the f node completes its operation, partial sum value, sun is 
updated, which completes the execution of the g node. The decoding tree for a block length of 8 is shown in 
Figure 3. 


Stage 3 Stage 2 Stage 1 


LLR(1,0) 


LLR(2,0) 


LLR(3,0) 


LLR(4,0) 


LLR(5,0) 


LLR(6,0) 


LLR(7,0) 


LLR(7,0) 


Figure 3. Proposed SC decoder for block length 8 


The use of less complex processing element (NHPE) has resulted in reducing the overall complexity 
of the decoder. The area and power have been reduced considerably in the proposed decoder. The decoding 
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schedule of the proposed decoder is shown in Table 3. NHPE(f) denotes the proposed processing element 
functioning in f mode and NHPE(g) denotes the proposed processing element functioning in g mode. The 
proposed scheduling enables the decoder to complete the process in 7 cycles. Thus, the proposed NHPE not 
only reduces the on-chip area but also reduces the latency of the decoder. 


Table 3. Decoding schedule of the proposed SCD for N=8 
Stage 


Clock cycle SI 5S) 53 Output 

1 NHPEw 

2 NHPEw 

3 NHPE) a, and ity 
4 NHPE@) NHPE@) itty and dy 
5 NHPE,) 

6 NHPEw as and ig 
7 NHPE, ) itty and fig 


3. RESULTS AND DISCUSSION 

The proposed SCD was modelled in Verilog HDL, implemented in Xilinx xc7vx980t FPGA and 
synthesized in Cadence 45nm CMOS technology. The proposed NHPE was used to construct the decoding 
tree and the decoder was tested upto a code length of 1024. A code rate of 0.5 was used in the design. The 
proposed model's area and power compared to the existing models [4], [22] are given below in Table 4 and 5. 
Considerable area and power reduction was achieved in the decoder architecture when the poposed NHPE 
was used. The functionality of the decoder was unaltered in the proposed architecture. In Table 4 it is 
emphasized that the area of the proposed model was reduced by at least 16.5% on average. 


Table 4. Proposed SCD area comparison report using TSMC 45nm technology 


Block length (N) Area, um’ Reduction (%) 
Yuan and Parhi [4] _ Zhang and Parhi[22] _ Proposed 
8 966.718 3621 8 966.718 
64 13917.83 30141 64 13917.83 
128 31980.37 60899 128 31980.37 
256 72258.24 121925 256 72258.24 
512 161123.60 243891 512 161123.60 
1024 355478.48 444944 1024 355478.48 


Table 5. Proposed SCD power comparison report using TSMC 45nm technology 


Power, »W : 
Block length (N) Yuan and Parhi [4] | Zhang and Parhi[22] Proposed Reduction (%) 
8 82.02 217.89 49.92 39.133 
64 1248.17 1725.03 763.04 38.867 
128 2828.38 3567.91 1730.83 38.80 
256 6620.32 7049.74 4063.27 38.62 
512 14012.20 14002.70 8632.40 38.392 
1024 30909.90 27220.40 19104.85 38.19 


This is the improvement brought by the proposed architecture. Also, Table 5 shows a 38% power 
reduction on an average which can be a considerably good improvement. Thus, the proposed processing 
element, NHPE and therefore the proposed SCD have shown a reduction in area and power compared to the 
available decoders. Table 6 shows the hardware performance of the proposed decoder. The performance of 
the proposed model was compared to other existing architectures at the same code length and code rate. It 
was evident that the performance of the proposed decoder is competent with the other existing decoders and 
can provide even better implementation results. The proposed decoder produced better hardware efficiency 
[23] among the other models and also good normalized throughput. The technology scaled normalized 
throughput (TNST) helps to compare the throughput across various technologies and is defined by [24], 


TNST = Throughput ) ,. Technology (8) 
Gate Count Target Technology 
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This helps to scale the throughput based on the target technology. From Tables 4, 5, and 6, it is 
evident that the proposed decoder stands out in significantly when compared to the state-of-the-art 
architectures in terms of low area, low power and higher efficiency without altering the performance. Hence, 
they can be used for area and power-efficient communication applications. 


Table 6. Hardware performance of decoder with code length, N=1024 


Design Yuan and Parhi [4] _ Leroux et al.[25]__ Kimetal. [6] Proposed 
Message form LLR LLR LLR LLR 
Code rate 0.5 0.5 0.5 0.5 
Tech. (nm) 65 65 40 45 
Quantization (bits) 6 5 5 5 
Frequency (Mhz) 390 500 1000 400 
Latency (clock cycles) 1056 2080 1023 1023 
Area (um”) 355478.48 308693 587070 300717.77 
Throughput (Mbps) 378 123 500 490 
TNST 2 1.65 3.03 3.1 
Efficiency (Mbps/mm’) 1064 398 851.78 1633 


4. CONCLUSION 

In this paper, a low power, area-efficient architecture for SCD was proposed. a new hybrid 
processing element was presented, which occupies less area and has low power consumption compared to 
existing architectures. The proposed processing element was used as the basic building block to construct the 
decoder. Also, the proposed architecture had better efficiency, which is also a significant feature. The 
proposed architecture can be combined with look-ahead techniques to further reduce the latency and speed up 
the decoder. The analysis and implementation conclude that the proposed decoder has inimitable benefits in 
terms of area and power. The proposed decoder has an error-free approach and finds application in the high- 
speed communication system, including 5G. 
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